<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
	<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
	<TITLE>Grid Engine Common Problems</TITLE>
	<META NAME="GENERATOR" CONTENT="StarOffice 6.0  (Solaris Sparc)">
	<META NAME="AUTHOR" CONTENT=" ">
	<META NAME="CREATED" CONTENT="20020111;13083600">
	<META NAME="CHANGED" CONTENT="20030826;9410500">
	<STYLE>
	<!--
		@page { size: 21.59cm 27.94cm }
	-->
	</STYLE>
</HEAD>
<BODY LANG="en-US">
<H1><FONT COLOR="#336699"><FONT SIZE=4 STYLE="font-size: 16pt"><B>Common
problems using Grid Engine</B></FONT></FONT></H1>
<P STYLE="margin-bottom: 0cm">Last updated: <SDFIELD TYPE=DATETIME SDNUM="1023;1033;MMM D, YYYY">Aug 26, 2003</SDFIELD></P>
<P STYLE="margin-bottom: 0cm">The present HOWTO goes over some
commonly seen problems experienced when using Grid Engine, and
appropriate solutions. The information is presented in a tabular
chart, using the following scheme:</P>
<P STYLE="margin-bottom: 0cm"><BR>
</P>
<TABLE WIDTH=288 BORDER=1 BORDERCOLOR="#000000" CELLPADDING=4 CELLSPACING=0 STYLE="page-break-inside: avoid">
	<COL WIDTH=136>
	<COL WIDTH=134>
	<THEAD>
		<TR>
			<TH COLSPAN=2 WIDTH=278 VALIGN=TOP BGCOLOR="#ff8080">
				<P><I>Category</I></P>
			</TH>
		</TR>
	</THEAD>
	<TBODY>
		<TR>
			<TH COLSPAN=2 WIDTH=278 VALIGN=TOP BGCOLOR="#ffcc99">
				<P><I>Symptom</I></P>
			</TH>
		</TR>
		<TR VALIGN=TOP>
			<TH WIDTH=136>
				<P><I>Cause</I></P>
			</TH>
			<TH WIDTH=134>
				<P><I>Resolution</I></P>
			</TH>
		</TR>
	</TBODY>
</TABLE>
<P STYLE="margin-bottom: 0cm">For problems which are not explicitly
mentioned here, search for a symptom in the appropriate category
which matches your problem as closely as possible, and see if the
resolution fixes your particular case.</P>
<H3>Categories:</H3>
<UL>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#batch">Batch Submit</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#monitoring">Monitoring</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#miscerrmsg">Miscellaneous
	Error Messages</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#performance">Performance</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#configuration">Configuration</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#interactive">Qrsh/Interactive
	Jobs</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#qmake">Qmake</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#qmon">Qmon</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#pe-ckpt">Parallel/Checkpointing</A></P>
	<LI><P STYLE="margin-bottom: 0cm"><A HREF="#shadow">Shadow Facility</A></P>
</UL>
<P STYLE="margin-bottom: 0cm"><BR>
</P>
<HR>
<P STYLE="margin-bottom: 0cm"><BR>
</P>
<TABLE WIDTH=100% BORDER=1 BORDERCOLOR="#000000" CELLPADDING=4 CELLSPACING=0 STYLE="page-break-inside: avoid">
	<COL WIDTH=128*>
	<COL WIDTH=128*>
	<THEAD>
		<TR>
			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
				<P><A NAME="batch"></A>Batch Submit</P>
			</TH>
		</TR>
	</THEAD>
	<TBODY>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P>My output file for my job says 
				</P>
				<PRE>&quot;Warning: no access to tty; thus no job control in this shell...&quot;</PRE>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P>One or more of your login files contain an stty command. These
				commands are only useful if there is a terminal present. 
				</P>
			</TD>
			<TD WIDTH=50%>
				<P>In Sun Grid Engine batch jobs, there is no termional
				associated with these jobs. You need to either remove all stty
				commands from your login files, or bracket them with an 'if'
				statement which checks for a terminal before processing. An
				example of this is below: 
				</P>
				<PRE>/bin/csh:
stty -g            # checks terminal status
if ($status == 0)  # succeeds if a terminal is present
&lt;place all stty commands in here&gt;
endif</PRE>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
				job standard error log file says:</FONT></FONT></FONT></P>
				<PRE STYLE="margin-bottom: 0.5cm">`tty`: Ambiguous</PRE><P>
				<FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>but
				there is no reference to tty in user's shell which is called in
				job script</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT FACE="Thorndale, serif"><FONT COLOR="#000000">shell_start_mode
				is by default posix_compliant; therefore, all job scripts run
				with the shell specified in the queue definition, not the one
				specified on the first line of the job script</FONT></FONT>.</P>
			</TD>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>use
				the -S flag to qsub, or change shell_start_mode to unix_behavior</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>I
				can run my job script from the command line, but it fails when
				run via qsub.</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Process
				limits could be being set for your job. To test this, write a
				test script which does limit and limit -h. Execute both
				interactively at the shell prompt and through qsub to compare the
				result.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>make
				sure to remove any commands in configuration files which sets
				limits in your shell.</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>qsub
				of a job results in the error &quot;can't set additional group id
				for job&quot; (seen in administrator or user mail, or shepherd
				trace file) and puts queue into error state</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Possible
				reasons</FONT></FONT></FONT></P>
				<OL>
					<LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
					error message below can occur if the user already have 16
					existing group ids set. SGE tries to set one more group id and
					fails b/c usually the limit is 16.</FONT></FONT></FONT></P>
					<LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>If
					you are not running Grid Engine as root, then the setgroups()
					command will fail trying to set the unique group ID which is
					used to track all the spawned processes of a job.</FONT></FONT></FONT></P>
				</OL>
			</TD>
			<TD WIDTH=50%>
				<OL>
					<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Corresponding
					solutions</FONT></FONT></FONT></P>
					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Please
					check to see how many group ids are assigned to the user using
					'id -a'. If it's more than 16, then you need to reduce this
					number or increase the limit in the kernel (NGROUPS_MAX).</FONT></FONT></FONT></P>
					<LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Be
					sure to run the Grid Engine daemons as root.</FONT></FONT></FONT></P>
				</OL>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Jobs
				work when run from command line but fail when run via qsub</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Data
				and executables may not be accessible where needed</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
				jobs script itself must be accessible from the submit host. All
				data and other executables needed by the script must be
				accessible on the execute host. Usually shared via NFS.</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Unlimited
				stack size set by default by SGE may cause some apps to crash on
				some OS's.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P>In the job script, use &ldquo;ulimit&rdquo; to set stack size
				limits before calling the executable that crashes.</P>
				<P>Or modify the queue to set smaller stack size:</P>
				<PRE>qconf -mattr queue h_stack 8389486 &lt;queue_name&gt; (hard limit in bytes)
qconf -mattr queue s_stack 8389486 &lt;queue_name&gt; (soft limit in bytes)</PRE>
			</TD>
		</TR>
		<TR>
			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
				<P><A NAME="monitoring"></A>Monitoring</P>
			</TH>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Exec
				hosts report a load of 99.99; queue is in &ldquo;alarm&rdquo;
				and/or &ldquo;unknown&rdquo; state</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>There
				are a few things that could cause your exec hosts to report a
				load of 99.99:</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<OL>
					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
					execd is not running on the host.</FONT></FONT></FONT></P>
					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>A
					default domain is incorrectly specified</FONT></FONT></FONT></P>
					<LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
					qmaster host sees the exec host as a different name as the exec
					host sees itself.</FONT></FONT></FONT></P>
				</OL>
			</TD>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Depending
				on the cause, here are the appropriate solutions</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<OL>
					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Start
					up the execd as root on the host by running the
					$SGE_ROOT/default/common/rcsge script </FONT></FONT></FONT>
					</P>
					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Run
					'qconf -mconf' as the Sun Grid Engine administrator and change
					the default_domain to none. </FONT></FONT></FONT>
					</P>
					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Set
					<FONT SIZE=2>IGNORE_FQDN=TRUE </FONT>for qmaster_params in
					cluster configuration.</FONT></FONT></FONT></P>
					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>See
					man page sge_h_aliases(5)</FONT></FONT></FONT></P>
					<LI><P><FONT FACE="Thorndale, serif"><FONT COLOR="#000000">Please
					see the AppNote/HOWTO <A HREF="http://supportforum.sun.com/gridengine/appnote_loadinfo.html" TARGET="_child">loadinfo</A>
					for more information. </FONT></FONT>
					</P>
				</OL>
			</TD>
		</TR>
		<TR>
			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
				<P><A NAME="miscerrmsg"></A>Miscellaneous Error Messages</P>
			</TH>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>A
				warning is printed to &lt;cell&gt;/spool/&lt;host&gt;/messages
				every 30 seconds. The messages look like this: </FONT></FONT></FONT>
				</P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE STYLE="margin-bottom: 0.5cm">Tue Jan 23 21:20:46 2001|execd|meta|W|local 
configuration meta not defined - using global configuration </PRE><P>
				<FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>But
				there IS a file for each host in &lt;cell&gt;/common/local_conf/,
				each with FDQN.</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
				hostname resolving at your machine &quot;meta&quot; returnes the
				short name, while at your master machine &quot;meta&quot; with
				FQDN is returned. </FONT></FONT></FONT>
				</P>
			</TD>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Make
				sure that all of your /etc/hosts files and your NIS table are
				consistent with this respect. In this example, there could be a
				line like</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE STYLE="margin-bottom: 0.5cm">168.0.0.1 meta meta.your.domain</PRE><P STYLE="margin-bottom: 0cm">
				<FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>in
				/etc/hosts of the host &quot;meta&quot; while it should be
				instead</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE>168.0.0.1 meta.your.domain meta</PRE>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P>Occasionally I see &quot;CHECKSUM ERROR&quot;, &quot;WRITE
				ERROR&quot; or &quot;READ ERROR&quot; messages in the &quot;messages&quot;
				files of the daemons. Do I need to worry about these?</P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><BR>
				</P>
			</TD>
			<TD WIDTH=50%>
				<P>As long as these messages do not appear in a one second
				interval (they<BR>typically may appear between 1-30 times per
				day), there is no need<BR>to do anything on this issue.</P>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Jobs
				will finish on a particular queue: qmaster/messages:</FONT></FONT></FONT></P>
				<PRE STYLE="margin-bottom: 0.5cm">Wed Mar 28 10:57:15 2001|qmaster|masterhost|I|job 490.1 finished on host exechost </PRE><P STYLE="margin-bottom: 0cm">
				<FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>But
				then the following errors are seen on the exec host: </FONT></FONT></FONT>
				</P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>exechost/messages:</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE STYLE="margin-bottom: 0.5cm">Wed Mar 28 10:57:15 2001|execd|exechost|E|can't find directory 
&quot;active_jobs/490.1&quot; for reaping job 490.1</PRE><P STYLE="margin-bottom: 0cm">
				<BR>
				</P>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>exechost/messages:</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE>Wed Mar 28 10:57:15 2001|execd|exechost|E|can't remove directory 
&quot;active_jobs/490.1&quot;: opendir(active_jobs/490.1) failed: Input/output error </PRE>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
				$SGE_ROOT directory, which is automounted, is being unmounted,
				causing the sge_execd to lose its cwd.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Use
				a local spool directory for you execd host. Set the parameter
				execd_spool_dir using qmon or qconf.</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<PRE>&ldquo;critical error: can't connect commd&rdquo;
&ldquo;critical error: setup failed starting cod_schedd&rdquo;</PRE>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P>A bug on 32 bit systems: <FONT SIZE=2>rlim_fd_max &gt; 1024 </FONT><FONT FACE="Thorndale, serif"><FONT COLOR="#000000">in
				/etc/system</FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Set
				rlim_fd_max to &lt; 1024. Or update to SGE 5.3p2 or higher</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P>The actual hostname &lt;myhostname&gt; of the machine is in
				alias to /localhost in etc/hosts. Looks like this:</P>
				<PRE>127.0.0.1   localhost  myhostname</PRE>
			</TD>
			<TD WIDTH=50%>
				<P>remove &lt;myhostname&gt; as an alias to localhost and put
				&lt;myhostname&gt; after the real IP-address in /etc/hosts</P>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P>Multiple queues cascade into error state, rendering the grid
				unusable. 
				</P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P>errors in a user's .cshrc/.profile result in setting all
				queues in error state</P>
			</TD>
			<TD WIDTH=50%>
				<OL>
					<LI><P>Fix errors in users' .cshrc/.profile</P>
					<LI><P>Use the -f option in the first line of the jobscript
					(i.e. Use &ldquo;!#/bin/sh -f&rdquo;) to bypass users' .cshrc or
					.profile</P>
				</OL>
			</TD>
		</TR>
		<TR>
			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
				<P><A NAME="performance"></A>Performance</P>
			</TH>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Memory
				leak and huge memory consumption for schedd on large systems</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT FACE="Thorndale, serif"><FONT COLOR="#000000">Parameter
				<CODE><FONT SIZE=2>sched_job_info=true</FONT></CODE></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P><FONT FACE="Thorndale, serif"><FONT COLOR="#000000">Set
				<CODE><FONT SIZE=2>sched_job_info= false</FONT></CODE> or update
				to release 5.3p3 or higher</FONT></FONT></P>
			</TD>
		</TR>
		<TR>
			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
				<P><A NAME="configuration"></A>Configuration</P>
			</TH>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>max_u_jobs
				doesn't work as expected.</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>It
				doesn't work exactly the same way in all versions of the product
				&ndash; and affects scheduling differently depending on whether
				the product is used in SGE or SGEEE mode. </FONT></FONT></FONT>
				</P>
			</TD>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Update
				to SGE 5.3p2 (or higher) which contains the latest
				implementation. </FONT></FONT></FONT>
				</P>
			</TD>
		</TR>
		<TR>
			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
				<P><A NAME="interactive"></A>Qrsh/Interactive Jobs</P>
			</TH>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Submitting
				interactive jobs with qrsh, I get the error: </FONT></FONT></FONT>
				</P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE STYLE="margin-bottom: 0.5cm">% qrsh -l mem_free=1G error: error: no suitable queues </PRE><P>
				<FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Yet
				queues are available for batch jobs using qsub, and can be
				queried using qhost -l mem_free=1G and qstat -f -l mem_free=1G.</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
				message &quot;error: no suitable queues&quot; results from the
				&quot;-w e&quot; submit option which is active by default for
				interactive jobs like qrsh (look for &quot;-w e&quot; in for
				qrsh(1)). This option causes the submit command to fail, if the
				qmaster does not know for sure that the job will be dispatchable
				according to the current cluster configuration. The intension of
				this mechanism is to decline job requests in advance in case they
				can't be granted.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>In
				this case 'mem_free' is configured to be a consumable resource,
				but you have not specified the amount of memory being available
				at each the host. The memory load values are deliberately not
				considered for this check, because they vary, so they can't be
				seen as part of the cluster configuration. To overcome this you
				can either </FONT></FONT></FONT>
				</P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<UL>
					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>omit
					this check generally by overriding qrsh's default setting &quot;-w
					e&quot; explicitly by submitting it with &quot;-w n&quot; (can
					also be put into $SGE_ROOT/&lt;cell&gt;/common/sge_request)</FONT></FONT></FONT></P>
					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>if
					you intend managing 'mem_free' as a consumbale resource specify
					the 'mem_free' capacity for your hosts in 'complex_values' of
					host_conf(5) by using 'qconf -me &lt;hostname&gt;'</FONT></FONT></FONT></P>
					<LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>if
					you don't intend managing 'mem_free' as consumable resource make
					it a non-consumable resource again in the 'consumable' column of
					complex(5) by using 'qconf -mc host'</FONT></FONT></FONT></P>
				</UL>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>qrsh
				wont dispatch to the same node it is on. From a qsh shell:</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE>host2 [49]% qrsh -inherit host2 hostname
error: executing task of job 1 failed: 

host2 [50]% qrsh -inherit host4 hostname
<FONT COLOR="#000000"><FONT FACE="Cumberland, monospace"><FONT SIZE=2><SPAN LANG="en-US">host4</SPAN></FONT></FONT></FONT></PRE>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>gid_range
				not sufficient. It should be defined as a range, not a single
				number. SGEEE assigns each job on a host a distinct gid.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Adjust
				gid_range using 'qconf -mconf' or qmon. The suggested range is:</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE>gid_range                 20000-20100</PRE>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>when
				I do a qrsh, I get this error..</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE>% qrsh
error: 1: can't set additional group id for job</PRE>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
				error message below can occur if the user already have 16
				existing group ids set. SGE tries to set one more group id and
				fails b/c usually the limit is 16.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Please
				check to see how many group ids are assigned to the user using
				'id -a'. If it's more than 16, then you need to reduce this
				number or increase the limit in the kernel.</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>qrsh
				-inherit -V does not work when used inside a parallel job:</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE>cannot get connection to &quot;qlogin_starter&quot;</PRE>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>This
				problem occurs with nested qrsh calls, and is due to the -V
				switch. The first qrsh -inherit call will set the environment
				variable TASK_ID (the id of the tightly integrated task within
				the parallel job). The second qrsh -inherit call will then use
				this environment variable for registration of its task which will
				fail as it tries to start a task with the same id as the already
				running first task.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>You
				can either</FONT></FONT></FONT></P>
				<UL>
					<LI><P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>unset
					TASK_ID before calling qrsh -inherit</FONT></FONT></FONT></P>
					<LI><P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>not
					use the -V switch, but -v and export only the environment
					variables really needed.</FONT></FONT></FONT></P>
				</UL>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>qrsh
				does not seem to work at all:</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE>host2$ qrsh -verbose hostname
local configuration host2 not defined - using global configuration 
waiting for interactive job to be scheduled ...
Your interactive job 88 has been successfully scheduled.
Establishing /share/gridware/utilbin/solaris64/rsh session to host exehost ...
rcmd: socket: Permission denied
/share/gridware/utilbin/solaris64/rsh exited with exit code 1
reading exit code from shepherd ... 
error: error waiting on socket for client to connect: Interrupted system call
error: error reading returncode of remote command
cleaning up after abnormal exit of /share/gridware/utilbin/solaris64/rsh
host2$ </PRE>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Permissions
				for qrsh are not set properly</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Check
				the permissions of the following files. They are located in
				$SGE_ROOT/utilbin/. </FONT></FONT></FONT>
				</P>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Note
				that rlogin and rsh need to be setuid and owned by root. </FONT></FONT></FONT>
				</P>
				<PRE>-r-s--x--x 1 root root 28856 Sep 18 06:00 rlogin*
-r-s--x--x 1 root root 19808 Sep 18 06:00 rsh*
-rwxr-xr-x 1 sgeadmin adm 128160 Sep 18 06:00 rshd*</PRE><P STYLE="margin-bottom: 0cm">
				<BR>
				</P>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>NOTE:
				the $SGE_ROOT directory also needs to be NFS-mounted with the
				&quot;setuid&quot; option. If it is mounted with &quot;nosuid&quot;
				from your submit client, then qrsh (and associated commands) will
				not work.</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Interactive
				jobs fail when run via qsh, without error message.</FONT></FONT></FONT></P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>DISPLAY
				variable may be set incorrectly</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P>Set DISPLAY correctly. Or to get error messages for this
				situation - upgrade to release 5.3p2 or higher</P>
			</TD>
		</TR>
		<TR>
			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
				<P><A NAME="qmake"></A>Qmake</P>
			</TH>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>When
				trying to start a distributed make qmake exits with the following
				error message:</FONT></FONT></FONT></P>
				<PRE>qrsh_starter: executing child process qmake failed: No such file or directory</PRE>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Grid
				Engine will start an instance of qmake on the execution host. If
				the Grid Engine environment (esp. the PATH) is not setup in the
				users</FONT></FONT></FONT></P>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>shell
				resource file (.profile/.cshrc) this qmake call will fail.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Use
				the -v option to export the PATH to the qmake job. A typical
				qmake call is</FONT></FONT></FONT></P>
				<PRE>qmake -v PATH -cwd -pe make 2-10 --</PRE>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>When
				doing qmake, the error seen is:</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<PRE>waiting for interactive job to be scheduled ...timeout (4 s) 
expired while waiting on socket fd 5

Your &quot;qrsh&quot; request could not be scheduled, try again later.</PRE>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>ARCH
				variable could be set incorrectly in shell which called qmake</FONT></FONT></FONT></P>
				<P STYLE="margin-bottom: 0cm"><BR>
				</P>
				<P><BR>
				</P>
			</TD>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm"><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Set
				ARCH variable correctly to a supported value matching a host
				available in your cluster, or else specify the correct value at
				submit time, eg,</FONT></FONT></FONT></P>
				<PRE>qmake -v ARCH=solaris64 ...</PRE>
			</TD>
		</TR>
		<TR>
			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
				<P><A NAME="qmon"></A>Qmon</P>
			</TH>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="font-weight: medium">Why am I am having refresh
				problems when running Qmon? 
				</P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Your
				system needs to be updated with the proper Xserver patch.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P><FONT FACE="Thorndale, serif"><FONT COLOR="#000000">X-server
				Patch <A HREF="http://sunsolve.Sun.COM/pub-cgi/show.pl?target=patches/patch-access" TARGET="_child">105633-48</A>
				fixes the problem of the icons not refreshing properly as well as
				icons painting over non-qmon windows. It is important to install
				the patch in console mode (no X running) otherwise the patch
				installation will fail</FONT></FONT></P>
			</TD>
		</TR>
		<TR>
			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
				<P><A NAME="pe-ckpt"></A>Parallel/Checkpointing</P>
			</TH>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="font-weight: medium">Parts of Sun HPC ClusterTools
				parallel jobs (job script itself, child processes, etc) fail to
				stop when terminated by user or by qmaster.</P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>The
				user may not have supplied the necessary means (scripts) for SGE
				to control the distributed jobs.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P>Follow the complete <A HREF="hpc.html">HOW-TO instructions</A>
				on Integration between Grid Engine and HPC Cluster Tools.</P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Bugs
				in early versions of loose integration package</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P>Update to SGE 5.3p2 (or higher) which includes latest MPI
				loose integration package</P>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P>Parallel jobs that run with the tight integration of SGE5.3.x
				and HPC CT 5 are not terminated if one of the queues has wall
				clock limit set.</P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>A
				bug in SGE prevented correct signal delivery to all parallel
				processes</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P>SGE 5.3p4 contains the fix; for earlier 5.3.x versions, get
				corresponding patches from <A HREF="http://sunsolve.Sun.COM/pub-cgi/show.pl?target=patches/patch-access">Sunsolve</A>:</P>
				<P>SGE: 113136-04 (pkgadd Solaris 32-bit); 113137-04 (pkgadd
				Solaris 64-bit); 113138-04 (pkgadd Solaris X86); 113663-02
				(pkgadd common pkg); 113849-03 (tar.gz Solaris 32-bit); 113850-03
				(tar.gz Solaris 64-bit); 113851-03 (tar.gz Solaris X86);
				113852-04 (tar.gz Linux); 113853-02 (tar.gz common package)</P>
				<P>SGEEE: 113139-04 (pkgadd Solaris 32-bit); 113140-04 (pkgadd
				Solaris 64-bit); 113636-03 (pkgadd common pkg); 113855-03 (tar.gz
				Solaris 32-bit); 113856-03 (tar.gz Solaris 64-bit); 113900-02
				(tar.gz Linux); 113857-02 (tar.gz common package)</P>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P>Parallel jobs that run with the tight integration of SGE5.3.x
				and HPC CT 5 would not suspend and resume correctly.</P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Another
				bug in SGE prevented STOP and CONT signals to be correctly
				delivered to all processes. </FONT></FONT></FONT>
				</P>
			</TD>
			<TD WIDTH=50%>
				<P>Need to set the suspend/resume methods in the queues used for
				the parallel jobs with the appropriate scripts. These scripts can
				either be downloaded from the Grid Engine Project site <A HREF="http://gridengine.sunsource.net/source/browse/gridengine/source/dist/mpi/sunhpc/tight-integration/">at
				this link</A> or obtained from Sun support.</P>
				<P>Releases beyond 5.3p4 will ship with these two scripts, a
				README file and a parallel environment template.</P>
			</TD>
		</TR>
		<TR>
			<TH COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ff8080">
				<P><A NAME="shadow"></A>Shadow Facility</P>
			</TH>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="font-weight: medium">After failover to shadow master,
				the schedd daemon remains running on the original qmaster</P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>This
				is a bug in earlier versions of SGE.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P>Update to 5.3p2 or higher</P>
			</TD>
		</TR>
		<TR>
			<TD COLSPAN=2 WIDTH=100% VALIGN=TOP BGCOLOR="#ffcc99">
				<P STYLE="font-weight: medium">Shadow host fails to own
				mastership of SGE cluster</P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Lock
				file exists.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P>Remove $SGE_ROOT/&lt;cell&gt;/spool/qmaster/lock file if
				master host has crashed or can no longer function as
				qmaster.<BR><B>NOTE:</B> to force the shadow host to take over
				from another master, use the &ldquo;migrate&rdquo; option, ie,
				&ldquo;rcsge -migrate&rdquo;.</P>
			</TD>
		</TR>
		<TR VALIGN=TOP>
			<TD WIDTH=50%>
				<P><FONT COLOR="#000000"><FONT FACE="Thorndale, serif"><FONT SIZE=3>Root
				R/W access to $SGE_ROOT directory and its sub-directories should
				be from both master and shadow.</FONT></FONT></FONT></P>
			</TD>
			<TD WIDTH=50%>
				<P STYLE="margin-bottom: 0cm">Adjust permissions for root r/w
				access to the $SGE_ROOT directory and its sub-directories from
				shadow host.</P>
				<P><B>NOTE: </B>please see the <A HREF="http://gridengine.sunsource.net/project/gridengine/howto/shadow.html">Shadow
				Master HOWTO</A></P>
			</TD>
		</TR>
	</TBODY>
</TABLE>
<P STYLE="margin-bottom: 0cm"><BR>
</P>
</BODY>
</HTML>